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Mathematical Algorithms to Maximize Performance in Numerical Weather Prediction 
Introduction 


Numerical weather prediction models, which involve the solution of non-linear 
partial differential equations at points on an extensive three-dimensional grid, 
are ideally suited for processing on vector machines. It was logical therefore 
that the new global forecast model to be implemented at the Meteorological Office 
should be written in vector code for the Cyber 205. 

In order to achive full efficiency and to reduce storage requirements the 
model used 32-bit arithmetic which had been found to provide high enough precision. 
Unfortunately, however, the trigonometrical and logarithmic functions provided 
by CDC could only handle 64-bit vectors and, although written in efficient scalar 
code, did not take advantage of the special facilities of a vector processor. It 
was therefore necessary to rewrite the functions in vector code to handle both 
32 and 64-bit vectors. There was also no half-precision compiler available for 
the Cyber 205 at that time and so the functions, like the model, had to make 
extensive use of the "special call" syntax. This made the code more difficult to 
write but it allowed much greater flexibility in that it became possible to access 
the exponent of a floating-point number independently of its coefficient. 

This paper presents a description of the techniques and it summarises the 
results which were achieved. One example, the logarithmic function, is treated 
here in detail to illustrate the general approach to the problem. 

Derivation of logarithms 

The coding for the logarithm function illustrates both the use of the way in 
which floating-point numbers are stored and the use of linked triads to gain 
additional speed. 

To calculate us lotg&e) we divide the range of x into two. the first of 
which is 

a) x, ^ Vi* and rc. < 

2 . 

We first write the value of x in a way which can be related to the format 
of stored floating-point numbers. Thus, introducing two new unknowns n and , 
r\ being an integer and _!_ < ui< | , we may write any number as re. - 7?*). 

Z ^ - 

Now the Cyber 205 stores the floating-point number as 


factor 2? 


Z^ P . COQ$iCMlSlt - 2^ P . Z*. ft 

is introduced by normalization. 


where the 


Since for logarithms, x. must always be positive, for 64-bit numbers bit 17 
will be on, so j =46 and for 32-bit numbers bit 9 will be on, so j = 23. 

Then relating the two, we have n^&xpj - j and Usk. 
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As an example, if x = 2.0 as a 64-bit normalized value 


so from the above formulae 


n - - +■$' + * I mnri w }9>-0 

Here, we can obtain the values of n and uf very easily as we can access 
the exponent and coefficient of a number by using special calls. 

The next step is to convert the functions into a suitable form for vectorization 
and this involves the introduction of a new variable 

/ «*- Vg/a. \ 

1 u> + -VJ/W which can be computed at the same 

time as u . 

Thasi <*}■:/ 1 1 g \ 

l l -2 / V 

From the original definition 

V »-* 

thus \OfaX.= £)'°S<’ Z 

b) For the remaining valu 
value of a- is defined by: 



t> - X.-I 
a > 1 


Then 


lo f* 


loj e _Lz* 


I •f-a. 


so that 


for 


Jts- i ± t 
t-a 

VT < 3t< Vz 1 . 
a. ~~ 

1 04 


»4 

1-4 


In each case, the problem then becomes one of vectorizing 8 ft (“ 
is easily done by replacing it with a truncated series which gives the required 
degree of precision: 


which 


l0 3« 


(£r) - L 

mmO 




».*«*♦ 1 


where the constants are known. 

Then log tf ^ i-rt - 

(((ICC c fc a a + c*) * x -f c 3 )*%Ca,) c,) a a + c* ) a 

Despite its complicated appearance, this reduces to eight vector operations 
consisting of a multiplication, six linked triads and a f inal multiplication 
by t thus 
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etc 


Multiplication to gire Z 1 

First triad = VI = c-f 

Second triad = V2 = v v t- x ^ 

Third triad = VJ = Y2. -*-C 9 

Tests, using the 1.5 compiler, and a range of vector lengths gave the 
following results, with times being expressed in units of 10”4 seconds. 


Vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC logarithms 

.3 

• 55 

.7 

1.01 

2.00 

3.66 

7.04 

21.50 

64-bit vector 
logarithm 

.47 

.61 

.78 

1.12 

2.16 

3.87 

7.47 

20.15 

32-bit vector 
logarithm 

.53 

.57 

.65 

.82 

1.34 

2.20 

3.99 

9.66 


The first point to notice here is that the full increase in speed for 
32-bit vectors is only achieved with large vector lengths. Because of the 
overheads associated with the initiation of vector instructions, this is not 
unexpected and is common to all of the functions to be described. What is 
unexpected is that no improvement in speed was achieved for our 64-bit function 
when compared to the CDC function. In this respect, this function is unique 
among all those treated in this paper. However, the original aim of producing 
a 32-bit version has been successfully achieved. 


Exponentials 

The exponential function is derived from the standard formula 

chosen to make use of special calls, k, m and f 


e*. z* .2”'* 

are defined as follows: 
If r\ = uvfc 


0*1 

L 


then 

and 


log, 

“[ft J 

k r V\b 


and mn o modulo 16 for x-y,0 

and < r ' = 14 — n modulo 16 for yc<0 


ffl-' 

{ = (H±\ -n 

\ toje 2 - ) 

m/ifc 

Now, since is integer and O £ rr\ < 16 , the factor 2. is 

obtained from a look-up table of 16 elements of known values, using the "special 
call" instruction Q8VXT0V . 


Having found the integer ft. from the above formula, and Z from the 

look-up table, to obtain the value Z* ■ = Z*'* rn,t<m we add fc to the 

exponent part of by using special calls. 
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The factor, 


is given by 


2 ?'* 

= Pi ^ 3 * £** P.£ « f» 

where f is obtained as above and P», Pi > P? are known constants. 

C j _ |H4 | £ 

Then, to obtain 2- all we need is a final multiply of 2f by 2 

^ The following results were achieved, times again being given in units of 
10 seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC exponential 

• 35 

.7 

.93 

1.44 

2.86 

5.25 

10.52 

33.36 

64-bit vector 

.47 

.6 

.78 

1.14 

2.29 

4.15 

7.97 

22.75 

exponential 
32-bit vector 

.47 

• 56 

.68 

.93 

1.85 

3.14 

5.85 

14.62 


exponential 

Here, for a vector length of 5000 the 52-bit exponential routine is only 
kQ% faster than the 64-bit routine because of the use of the "special call" 
Q8VXT0V. However the 64-bit routine has achieved a considerable speed-up over 
the GDC exponential. 

The Hyperbolic functions 

The routines to calculate the hyperbolic functions u~ caikx u; 5 uihx. 
and 'cash X use the following formula, ; 

cosk oc - 2 £ 

x x 

The calculation of «• is as described earlier. During the calculation of e. , 

little extra work is required to obtain e'° t which avoids the need to call the 

exponential routine twice. 

The hyperbolic sine is given by 


and 


SU\ h X ; i. ( ) 

for 

~ a.f*'-* 1 

54/thX.c / 50 

rose 

for 


)cc|£ 0.5 


\-x. \ < 0.5 


Here the two distinct cases are treated independently, so that we are dealing 
with shorter vector lengths, and then the results are merged together at the end 
of the routine. The polynomial expansion of sinh x can be performed in seven 
vector instructions, by using linked triads. 

The hyperbolic tangent is given by 


tiUlh 3C 


■£ 

mso 




2.nr\-r » 


for Q< |sc.| £ o. 
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tan hx r 

1 - 3- 

for 

0- 12. < 1 8.0 




fcxnUx r 

1. O 

for 

X y iS.O 

ihsC «. 

-1.0 

for 

x < - 1 9.0 


Again, the distinct cases are treated independently so that we are dealing 
with shorter rector lengths, and again we can use linked triads when calculating 
the polynomial expansion of ranhsc . 

The timings of the hyperbolic sine and hyperbolic tangent routines are data 
dependent, but some sample timings are given below. All times are expressed in 
units of 10 _i+ seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

hyperbolic 

cosine 

64-bit vector 

.55 

.79 

1.08 

1.68 

3.45 

6.41 

13.26 

37.65 

32-bit vector 

• 54 

.69 

.88 

1.27 

2.44 

4.44 

8.72 

22.99 

hyperbolic 

sinh 

64-bit vector 

.75 

.99 

1.30 

1.96 

3.88 

7.27 

14.87 

43.85 

32-bit vector 

.72 

.87 

1.07 

1.48 

2.74 

5.00 

9.47 

24.38 

hyperbolic 
tangent 
64-bit vector 

.66 

.87 

1.15 

1.68 

3.33 

6.01 

11.79 

34.83 

32-bit vector 

.64 

.73 

.89 

1.21 

2.30 

3.66 

6.87 

17.76 

Again, we see 

that for 

very 

short 

vector lengths we 

do not 

have a 


advantage by using 32-bit vectors, but for longer vector lengths we are approaching 
twice the speed of the 64-bit functions. There were no CDC functions available to 
compare with our results. 

Sines and cosines 


The trigonometrical functions, Si> »x and yzcosx, are calculated from 

the polynomial expansion of tCnx so that we can make use of linked triads 
again. First the input argument needs to be reduced modulo 2.1T . This is achieved 
by 


letting 
then put 


•r, s z I xl 

^ T 


and 


\nt 


[-Nr 1 ] 


so that I • 


and £ a •»-£ modulo 4 


So 


tin (vc) i a 

SC ns cs 


given by 
sin a 

for 

ksO 

sin (i - a) 

for 

ksl 

-Sen a 

for 

2 

- Sin (i -l) 

for 
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for 64-bit function 


9 

Z <*•!'*■ I 

where su\i. = / ^ * 

and the constants C m are known. 

Because the values Cj. and are too small to affect the accuracy of the 

32-bit function results: 



& 

iU\t - / 

2sn-r-l 

for 32-bit vector function 


The cosine function 

is given by 




C06X. = iih ^ 

) 

where ^-4- + *■) is calculated 

as above. 

If it is 
radians , 

known that 
much work 

the input operand, x, is always between -2-TT and 
can be left out of the routine; 

+ a.7T 

for as above let 

r, = Z |*| 
'7T~ 

and -iv - Jrtfe I" z |sci 

L TT i 





and 5o O ^ £ 3 





and k-T^ module 4 * 

*a. 



and again 2 . 

= T, - fj. 90 O * * < / 


So for 





for 


r,- l 

J 5U* X. SCrt(j - 2 .) = 5l>l 


for 


2 * a. 

> - ~s^O> - s*n£a.— r,) 


for 


2 = V r t -3 

St=- 



Thus we have two sets of functions, one set to calculate the sine and cosine 
of any angle expressed in radians, and the other to calculate the sine and cosine 
of angles between -2.TT and + ZlT radians. 

The polynomial expansion of sin(z) can be calculated in ten vector instructions 
including eight linked triad instructions for the 64-bit function and in eight 
vector instructions using six linked triad instructions for the 32-bit functions. 

, Tests gave the following results with times given are expressed in units of 
10~ seconds. 
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vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC sine 

.15 

.5 

.64 

.91 

1.72 

3.07 

6.13 

22.98 

64-bit vector 

.49 

.59 

.72 

.98 

1.74 

3.02 

5-59 

14.98 

sine (all angles) 
32 -bit vector 

.42 

.46 

.52 

.63 

.98 

1.57 

2.76 

6.35 

sine (all angles) 
64-bit vector 

.37 

.44 

.53 

.72 

1.27 

2.20 

4.07 

10.04 

sine ( - ITT 
to +XTT ) 
32 -bit vector 

.34 

.37 

.41 

.50 

.75 

1.20 

2.09 

4.78 


sine (~*JT 


to * 2Jtf ) 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC cosine 

.3 

.55 

.68 

.99 

2.08 

3.29 

6.68 

23.59 

64-bit vector 
cosine (all 
angles) 

.57 

.60 

.73 

.99 

1.87 

3.19 

5.94 

16.00 

32 -bit vector 
cosine (all 
angles) 

.69 

.47 

.51 

.63 

1.0 

1.70 

2.94 

6.95 

64-bit vector 
cosine (.-ZTT 
to +Z1T ) 

.72 

.45 

.55 

.74 

1.42 

2.40 

4.45 

11.14 

32 -bit vector 

.67 

.37 

.41 

.50 

.77 

1.37 

2.31 

5.51 


cosine ( - ZTT 
to XTT ) 


Thus, we can see that we need a vector length of 500 to 1000 before our 
64-bit routines for »n angles are faster than the CDC supplied routines, but 
that our 32-bit routines for restricted angles between -XTT and +XIT a re 

over four times as fast as the CDC routines for vector lengths of 5000. 

Tangents 


Similarly for the trigonometrical function, y - ta/ix we have supplied 
two sets of functions, one set to calculate the tangent of any angle expressed 
in radians in both 64-bits and the other to calculate the tangent of angles 
between - 2.77" and e £77” radians in both 64-bits and 32-bits. The tangent 
function is calculated using a polynomial expansion of tan(x) to make use of 
linked triads. The calculation is performed by first reducing the argument 
modulo zTT 

Let r, s and -r* = 

IT 

then ? * v, -T t so that 

Now let Si -r, modulo 8 , putting if O 

and 4*s-4 if 
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tan (jo) is now given by 


= ton £*) 

for 

fc-C 7 

- I 

for 

ks 1 

tanC*-i) 

-/ 

for 

k-2 

tan C ») 

-*■ tan, ( x - 1 ) 

for 

k=3 


n 

where tan(t) = ^ c rn w Xin * 1 to the required degree 


«**eo 


Again, if it ia known that the input operand is always 
+ ZH T radians, we can write: 


of precision, 
between - Zl T and 



and so C 4 -r^ < ?• 


In this case * s Tj_ modulo 8 a r t 

Then ^ where 
and k->. r x - + where 

and the calculation continues as before. 

The polynomial expansion of tan(z) is calculated in fourteen vector 
instructions using twelve linked triads. 

The resulting timings of tests are given below, expressed in units of 10 
seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC tangent 

.98 

.73 

.91 

1.47 

2.61 

4.71 

9.33 

30.80 

64-bit vector 
tangent (all 
angles) 

.90 

.82 

.99 

1.35 

2.55 

4.48 

8.4o 

22.67 

32 -bit vector 
tangent (all 
angles) 

.96 

.78 

.90 

1.14 

1.92 

3.21 

5-59 

13.29 

64-bit vector 
tangent (-W 
to +-X7T ) 

.67 

.76 

.93 

1.25 

2.36 

4.14 

7.74 

20.64 

32 -bit vector 
tangent ( -A7T 
to v XlT ) 

.67 

.70 

O 

OO 

a 

.99 

1.76 

2.94 

5.15 

11.98 
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These results show that we need a rector length of only about 200 before 
our 64-bit tangent function for all angles is faster than the CDC routine, and 
that our 32 -bit tangent function for restricted angles between - 2 T and 
+ £JT radians is well orer twice as fast as the CDC routine. 

The Arctangent function 

The arctangent function ^■sa.bafufa) is again calculated from a polyno mi a l 
expansion so that we can use linked triads. The calculation is performed as 
follows: 


For | sc| * 1 

let *? = _!_ 

I*f 

and for j sc / < V? + / 

let ^ - |a-/ 


Change the variable to z, defined by 

* = u> - a. 

2 

4, v a a. 

where, a is chosen so that z = 1.0 when 

Under this condition, a. = 0 - V?)+/ 4 - ZVJT , and is therefore a 

constant. 

Then atan(x) is given by 

atan ( x ) =atan ( z ) +atan ( a ) 

Here, atan(a) is a constant and need only be calculated once, and we may replace 
atan(z) by the truncated series: 

a.b{u\ (*) = y\ 

For fa\ •> ViV I a,traj\fa) - 3" - 

and for as<<9 / Q.baj\fa') = - cl tan fa) 

Atan(z) can be calculated in ten vector instructions, eight of which are 
linked triad instructions. The results are in the range -7f to +77* (not 
inclusive). -L 

. The following results were achieved, times again being given in units of 
10" seconds. 


vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC arctangent 

.38 

.90 

1.19 

1.97 

4.06 

7.35 

16.19 

46.09 

64-bit vector 
arctangent 

.48 

.52 

.66 

.92 

1.91 

3.07 

5.77 

15-23 

32 -bit vector 
arctangent 

.43 

.49 

.55 

.69 

1.10 

1.79 

3.54 

7.27 


These results are spectacular, in that the 32-bit arctangent function is 
over six times as fast as the CDC routine and even the 64-bit version has given 
a threefold increase in speed. 
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Derivation of arcsine and arccosine functions 


The final trigometric routines to be considered calculate the arcsine and 
arccosine of x. The calculations are performed as follows. 

for 0$x< '/z , let * = x so that asin(x) = asin(z) 

and for i c x < l , let z = ( l - * and asin(x) = 7T _ Za*u i (a) 

*• la.^ z 

for -/ ir <0 , asin(x) = asin(-x) and the sane substitutions are used. 

Now the new variable, z, oust be between zero and 0.7 so we may write 


precision. 


to the required degree of 


The arccosine function is derived from the arcsine using the substitution 

0.C4 6 (x.) X T T - OAUlfac) 

Z 

The polynomial expansion of asin(z) is calculated in thirteen vector 
instructions, eleven of which are linked triads. The range of the results for 
arccosine is -TT to + U inclusive, and for arccosine is 0 toTT inclusive. 

X Z 

-k 

The following results were achieved, with times expressed in units of 10 


seconds. 









vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC aeesine 

• 5 

.67 

.87 

1.27 

2.6 

4.73 

9.64 

29.84 

64-bit vector 
arcsine 

.52 

.61 

.75 

1.04 

2.02 

3-55 

6.69 

16.54 

32 -bit vector 
arccosine 

.54 

.51 

.58 

.73 

1.37 

2.25 

3.91 

9.11 

vector length 

10 

50 

100 

200 

500 

1000 

2000 

5000 

CDC arccosine 

.26 

.68 

.89 

1.27 

2.41 

4.35 

9.16 

28.55 

64-bit vector 
arccosine 

.51 

.61 

.76 

1.05 

1.95 

3.44 

6.44 

18.73 

32 -bit vector 
arccosine 

.48 

.54 

.61 

.76 

1.25 

2.07 

3.66 

8.59 


Here our 32-bit functions are over three times as fast as the CDC routines, for 
vector lengths of 5000 . 


Conclusion 


The trigonometrical and logarithmic functions, as provided by CDC up to and 
including version 2.0 of the compiler are, in general, not very efficient. At 
the Meteorological Office, we found it necessary to hand-code these functions in 
vector syntax to take full advantage of the facilities of the Cyber 205. For the 
32 -bit versions, which have a high enough precision for most of our purposes, 
speed increases of up to six times were obtained and even for our 64-bit versions. 
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increases of up to three times are possible. However, CDC have undertaken to 
provide fully vectorized versions of the trigonometrical andlogarithmic functions 
in both 64-bits and 32 -bits by release 2.1 of the compiler. 

The functions described were written in the "special call" syntax because 
of compiler limitations and the difficulties associated with this were partly 
offset by the special features which were then available. Users with the 2.0 
compiler could find that the extra facilities provided by the "special calls" 
do not overcome the difficulties involved with this syntax and that coding 
explicitly in the FORTRAN vector syntax achieves sufficient vectorization for 
their own purposes. 
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